{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fFIsBLlF7YFv"
   },
   "source": [
    "# A Simple Case Study using Wage Data from 2015"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ssGRQl-d7U9O"
   },
   "source": [
    "We illustrate how to predict an outcome variable $Y$ in a high-dimensional setting, where the number of covariates $p$ is large in relation to the sample size $n$. We use linear prediction rules for estimation, including OLS and the penalized linear methods we've studied. Later, we will also consider nonlinear prediction rules including tree-based methods and neural nets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "57TFoHNk8BIg"
   },
   "outputs": [],
   "source": [
    "# Import relevant packages\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.model_selection import cross_val_score, KFold\n",
    "from sklearn.metrics import r2_score\n",
    "from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV, LinearRegression, Ridge\n",
    "import patsy\n",
    "import warnings\n",
    "from sklearn.base import BaseEstimator\n",
    "import statsmodels.api as sm\n",
    "warnings.simplefilter('ignore')\n",
    "np.random.seed(1234)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yYmcd6mN7VCV"
   },
   "source": [
    "## Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1n1-LWsu53N6"
   },
   "source": [
    "Again, we consider data from the U.S. March Supplement of the Current Population Survey (CPS) in 2015.\n",
    "The preproccessed sample consists of $5150$ never-married individuals."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1or5aUNr7yTv"
   },
   "source": [
    "Set the following file_directory to a place where you downloaded https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/wage2015_subsample_inference.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "vk9EF6RNNK9r"
   },
   "outputs": [],
   "source": [
    "file = \"https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/wage2015_subsample_inference.csv\"\n",
    "data = pd.read_csv(file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "7pn1ukslNK9s"
   },
   "outputs": [],
   "source": [
    "data.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "fUxf95F1B2EE"
   },
   "outputs": [],
   "source": [
    "y = np.log(data['wage']).values\n",
    "Z = data.drop(['wage', 'lwage'], axis=1)\n",
    "Z.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "v2E-oxA8DQqH"
   },
   "source": [
    "The following figure shows the weekly wage distribution from the US survey data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "4QJc_mnlB2KN"
   },
   "outputs": [],
   "source": [
    "plt.hist(data.wage, bins=np.arange(0, 350, 20))\n",
    "plt.xlabel('hourly wage')\n",
    "plt.ylabel('Frequency')\n",
    "plt.title('Empirical wage distribution from the US survey data')\n",
    "plt.ylim((0, 3000))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TvMhq0RNDL43"
   },
   "source": [
    "Wages show a high degree of skewness. Hence, wages are transformed in almost all studies by\n",
    "the logarithm."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xBL1FmvgDV3f"
   },
   "source": [
    "## Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nDbY6BAuDYVD"
   },
   "source": [
    "Due to the skewness of the data, we are considering log wages which leads to the following regression model\n",
    "\n",
    "$$\\log(\\operatorname{wage}) = g(Z) + \\epsilon.$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jNACoPwVDdpK"
   },
   "source": [
    "In this notebook, we will evaluate *linear* prediction rules. In later notebooks, we will also utilize nonlinear prediction methods. In linear models, we estimate the prediction rule of the form\n",
    "\n",
    "$$\\hat g(Z) = \\hat \\beta'X.$$\n",
    "\n",
    "Again, we generate $X$ in three ways:\n",
    "\n",
    "1. Basic Model:   $X$ consists of a set of raw regressors (e.g. gender, experience, education indicators, regional indicators).\n",
    "\n",
    "\n",
    "2. Flexible Model:  $X$ consists of all raw regressors from the basic model plus occupation and industry indicators, transformations (e.g., $\\operatorname{exp}^2$ and $\\operatorname{exp}^3$) and additional two-way interactions.\n",
    "\n",
    "3. Extra Flexible Model: $X$ takes the flexible model and takes all pairwise interactions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "iDM3apFhDlgf"
   },
   "source": [
    "To evaluate the out-of-sample performance, we split the data first."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PPt0pZtgOMSn"
   },
   "source": [
    "As known from our first lab, the basic model consists of $51$ regressors, and the flexible model has $246$ regressors. Let us fit our models to the training sample using the two different model specifications. We are starting by running a simple ols regression and computing the $R^2$ on the test sample."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "n8Yzwj-tNK9v"
   },
   "source": [
    "### Low dimensional specification (basic)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "H133LWt2NK9w"
   },
   "outputs": [],
   "source": [
    "Zbase = patsy.dmatrix('0 + sex + exp1 + shs + hsg+ scl + clg + mw + so + we + C(occ2) + C(ind2)',\n",
    "                      Z, return_type='dataframe').values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "1cGOSULrNK9w"
   },
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(Zbase, y, test_size=0.25, random_state=123)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "mYNA5NqJNK9w"
   },
   "outputs": [],
   "source": [
    "lr_base = LinearRegression().fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8yZ-FNUcNK9w"
   },
   "source": [
    "Let's calculate R-squared on the test set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "syCxRYoXNK9x"
   },
   "outputs": [],
   "source": [
    "r2_base = 1 - np.mean((y_test - lr_base.predict(X_test))**2) / np.var(y_test)\n",
    "print(f'{r2_base:.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7xxH378TNK9y"
   },
   "source": [
    "In fact `sklearn` provides an implementation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "dFdiLG1PNK9y"
   },
   "outputs": [],
   "source": [
    "print(f'{r2_score(y_test, lr_base.predict(X_test)):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RuM-FygCNK9y"
   },
   "source": [
    "Since out of sample performance can be vary for different train-test splits, it is more stable to look at average performance across multiple splits, using K-fold cross validation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "110qB5jmNK9z"
   },
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(LinearRegression(), Zbase, y, scoring='r2', cv=cv)\n",
    "print(f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "J5fn7JDiNK9z"
   },
   "source": [
    "### High-dimensional specification (flexible)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jX09oKqhRJgz"
   },
   "source": [
    "We repeat the same procedure for the flexible model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "VMLyFWPJNK9z"
   },
   "outputs": [],
   "source": [
    "Zflex = patsy.dmatrix('0 + sex + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we)',\n",
    "                      Z, return_type='dataframe').values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "wV05akw4ZzBP"
   },
   "outputs": [],
   "source": [
    "Zflex.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "bR37N6OPNK9z"
   },
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(Zflex, y, test_size=0.25, random_state=123)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "5Yh0GAGsNK9z"
   },
   "outputs": [],
   "source": [
    "lr_flex = LinearRegression().fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "3Ou7AeiONK90"
   },
   "outputs": [],
   "source": [
    "print(f'{r2_score(y_test, lr_flex.predict(X_test)):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "VAvIPK3fNK90"
   },
   "source": [
    "However, OLS can be quite unstable for such high-dimensional problems, and it really matters what solution is being returned among the multitude of solutions to the least squares objective -- the solution is non-unique in high-dimensional settings. For instance, we see that the `sklearn` implementation returns a numerically unstable solution whose error blows up in some cases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "qoP4kyMTNK90"
   },
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(LinearRegression(), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print(f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bGMdlPl4NK90"
   },
   "source": [
    "`sklearn`'s implementation uses the least squares solver from `scipy.linalg.lstsq`. If, for instance, we instead use a different pseudo-inverse based implementation, we get a different result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "xR5Q4Ub8NK91"
   },
   "outputs": [],
   "source": [
    "class MyOLS(BaseEstimator):\n",
    "\n",
    "    def fit(self, X, y):\n",
    "        X = np.hstack([np.ones((X.shape[0], 1)), X])\n",
    "        CXX = (X.T @ X) / X.shape[0]\n",
    "        CXy = (X.T @ y) / X.shape[0]\n",
    "        self.coef_ = np.linalg.pinv(CXX) @ CXy\n",
    "        return self\n",
    "\n",
    "    def predict(self, X):\n",
    "        X = np.hstack([np.ones((X.shape[0], 1)), X])\n",
    "        return X @ self.coef_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Up83GJNTNK91"
   },
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(MyOLS(), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print(f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "LxHM8DZNNK91"
   },
   "source": [
    "This procedure also recovers the solution provided by `statsmodels.api.OLS`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "8j2GR643NK91"
   },
   "outputs": [],
   "source": [
    "class StatsModelsOLS(BaseEstimator):\n",
    "\n",
    "    def fit(self, X, y):\n",
    "        X = np.hstack([np.ones((X.shape[0], 1)), X])\n",
    "        self.ols_ = sm.OLS(y, X).fit()\n",
    "        return self\n",
    "\n",
    "    def predict(self, X):\n",
    "        X = np.hstack([np.ones((X.shape[0], 1)), X])\n",
    "        return self.ols_.predict(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "wFSNrOSjNK92"
   },
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(StatsModelsOLS(), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print(f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jq5TwDZFNK92"
   },
   "source": [
    "We can also choose different solvers by using `sklearn.linear_model.Ridge` which allows for no penalty and a multitude of solvers. We see that the `lsqr` solver is more stable than solvers based on singular value decompositions of the covariance matrix $E_n[X X']$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "-AQ6S9ICNK92"
   },
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(Ridge(alpha=0.0, solver='lsqr'), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print(f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "-upj7kyMNK92"
   },
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(Ridge(alpha=0.0, solver='cholesky'), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print(f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "41Cco4TGNK92"
   },
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(Ridge(alpha=0.0, solver='svd'), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print(f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JKqXwJhqNK93"
   },
   "source": [
    "### Penalized regressions (flexible model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-4QL8R2OUbT_"
   },
   "source": [
    "We observe that ols regression works better for the basic model with smaller $p/n$ ratio. We proceed by running penalized regressions first for the flexible model, tuned via cross-validation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9vwdoHe_NK93"
   },
   "source": [
    "First we try a pure `l1` penalty, tuned using cross-validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ANWxIjyqNK93"
   },
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(LassoCV(cv=cv), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print('Flexible model R^2 (Lasso): ', f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "p7fdRRERNK94"
   },
   "source": [
    "Oops! For penalized regressions it is important that our features have the same standard deviation, so that we are symmetrically penalizing them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "bbslexlGNK94"
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "Zflex = StandardScaler().fit_transform(Zflex)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "B4B068QgNK94"
   },
   "source": [
    "Let's try again!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "3HhaI0CDNK95"
   },
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(LassoCV(cv=cv), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print('Flexible model R^2 (Lasso): ', f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dDpwnCFFNK95"
   },
   "source": [
    "Then we try a pure `l2` penalty, tuned using cross-validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "s_G1exWNNK96"
   },
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(RidgeCV(cv=cv), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print('Flexible model R^2 (Ridge): ', f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "VHGkDj4WNK96"
   },
   "source": [
    "Finally, we try an equal combination of the two penalties, with the overall weight tuned using cross validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "e9ouiPUcNK97"
   },
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(ElasticNetCV(cv=cv), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print('Flexible model R^2 (Elastic Net): ', f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BH9DIjrjNK97"
   },
   "source": [
    "We can also try a variant of the `l1` penalty, where the weight is chosen based on theoretical derivations. This is a based on a Python implementation that tries to replicate the main function of the hdm r-package. It was made by [Max Huppertz](https://maxhuppertz.github.io/code/). His library is this [repository](https://github.com/maxhuppertz/hdmpy). If not running on colab, download the repository and copy this folder to your site-packages folder. In my case, it is located here ***C:\\Python\\Python38\\Lib\\site-packages*** . It requires the multiprocess package ***pip install multiprocess***."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "f2N1EUC7T_8S"
   },
   "source": [
    "Specifically, we use \"plug-in\" tuning with a theoretically valid choice of penalty $\\lambda = 2 \\cdot c \\hat{\\sigma} \\sqrt{n} \\Phi^{-1}(1-\\alpha/2p)$, where $c>1$ and $1-\\alpha$ is a confidence level, $\\Phi^{-1}$ denotes the quantile function, and $\\hat{\\sigma}$ is estimated in an iterative manner (see corresponding notes in book). Under homoskedasticity, this choice ensures that the Lasso predictor is well behaved, delivering good predictive performance under approximate sparsity. In practice, this formula will work well even in the absence of homoskedasticity, especially when the random variables $\\epsilon$ and $X$ in the regression equation decay quickly at the tails.\n",
    "\n",
    "In practice, many people choose to use cross-validation, which is perfectly fine for predictive tasks. However, when conducting inference, to make our analysis valid we will require cross-fitting in addition to cross-validation. As we have not yet discussed cross-fitting, we rely on this theoretically-driven penalty in order to allow for accurate inference in the upcoming notebooks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "EuulnMD7UQIB"
   },
   "outputs": [],
   "source": [
    "!git clone https://github.com/maxhuppertz/hdmpy.git\n",
    "!pip install multiprocess"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "MAngMASGNK97"
   },
   "outputs": [],
   "source": [
    "# We wrap the package so that it has the familiar sklearn API\n",
    "import hdmpy\n",
    "\n",
    "\n",
    "class RLasso(BaseEstimator):\n",
    "\n",
    "    def __init__(self, *, post=True):\n",
    "        self.post = post\n",
    "\n",
    "    def fit(self, X, y):\n",
    "        self.rlasso_ = hdmpy.rlasso(X, y, post=self.post)\n",
    "        return self\n",
    "\n",
    "    def predict(self, X):\n",
    "        return X @ np.array(self.rlasso_.est['beta']).flatten() + self.rlasso_.est['intercept'].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Qa5DMdQWNK97"
   },
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(RLasso(), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print('Flexible model R^2 (RLasso): ', f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "YDfzTw59NK98"
   },
   "source": [
    "Finally, we try the combination of a sparse and a dense coefficient using the LAVA method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "fI_DjDtvNK98"
   },
   "outputs": [],
   "source": [
    "# We construct an sklearn API estimator that implements the LAVA method\n",
    "\n",
    "class Lava(BaseEstimator):\n",
    "\n",
    "    def __init__(self, *, alpha2=1, iterations=3):\n",
    "        self.alpha2 = alpha2\n",
    "        self.iterations = iterations\n",
    "\n",
    "    def fit(self, X, y):\n",
    "        lasso = RLasso(post=False).fit(X, y)\n",
    "        ridge = Ridge(self.alpha2).fit(X, y - lasso.predict(X).flatten())\n",
    "\n",
    "        for _ in range(self.iterations - 1):\n",
    "            lasso = lasso.fit(X, y - ridge.predict(X))\n",
    "            ridge = ridge.fit(X, y - lasso.predict(X).flatten())\n",
    "\n",
    "        self.lasso_ = lasso\n",
    "        self.ridge_ = ridge\n",
    "        return self\n",
    "\n",
    "    def predict(self, X):\n",
    "        return self.lasso_.predict(X) + self.ridge_.predict(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "pU2NAg9WNK99"
   },
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(Lava(alpha2=20), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print('Flexible model R^2 (LAVA): ', f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1uKW4L69NK99"
   },
   "source": [
    "<!-- We find that for this dataset the low dimensional OLS was the best among all specifications. The high-dimensional approaches did not manage to increase predictive power. -->\n",
    "We find that for this dataset the low dimensional OLS is sufficient. The high-dimensional approaches did not manage to substantively increase predictive power."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BFKIU9ewUAzM"
   },
   "source": [
    "### Extra high-dimensional specification (extra flexible)\n",
    "\n",
    "We repeat the same procedure for the extra flexible model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "rdKJgr9AUgqI"
   },
   "outputs": [],
   "source": [
    "Zextra = patsy.dmatrix('0 + sex + (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we)**2',\n",
    "                       Z, return_type='dataframe').values\n",
    "Zextra.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "jKbHAPXYUotO"
   },
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(Zextra, y, test_size=0.25, random_state=123)\n",
    "lr_extra = LinearRegression().fit(X_train, y_train)\n",
    "print(f'{r2_score(y_test, lr_extra.predict(X_test)):.4f}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Gfa5uUiiUowQ"
   },
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(StatsModelsOLS(), Zextra, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print(f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "k_rRXS1FYG_R"
   },
   "source": [
    "#### Penalized regressions (extra flexible model)\n",
    "\n",
    "Now let's repeat our penalized regression analysis for the extra flexible model. Note this block takes a while!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "oWgvAY1RYJXV"
   },
   "outputs": [],
   "source": [
    "Zextra = StandardScaler().fit_transform(Zextra)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "YEvTN7kpYMKD"
   },
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(LassoCV(cv=cv), Zextra, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print('Extra flexible model R^2 (Lasso): ', f'{np.mean(rsquares):.4f}')\n",
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(RidgeCV(cv=cv), Zextra, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print('Extra flexible model R^2 (Ridge): ', f'{np.mean(rsquares):.4f}')\n",
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(ElasticNetCV(cv=cv), Zextra, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print('Extra flexible model R^2 (Elastic Net):', f'{np.mean(rsquares):.4f}')\n",
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(RLasso(), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print('Extra flexible model R^2 (RLasso):', f'{np.mean(rsquares):.4f}')\n",
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(Lava(alpha2=20), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print('Extra flexible model R^2 (LAVA):', f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pCyYcqqlxMz3"
   },
   "source": [
    "As shown above, the overfitting effect is mitigated with the penalized regression model."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
